Nonparametric estimation of the stationary 
density and the transition density of a Markov 

chain 



Claire Lacour 

MAP5, Universite Paris Descartes, CNRS UMR 8145, 45 rue des Saints-Peres 

75270 Paris Cedex 06, France 
lacour @math-info. univ-paris5.fr 



Abstract 

In this paper, we study first the problem of nonparametric estimation of the sta- 
tionary density / of a discrete-time Markov chain (X{). We consider a collection 
of projection estimators on finite dimensional linear spaces. We select an estimator 
among the collection by minimizing a penalized contrast. The same technique en- 
ables to estimate the density g of (Xi,Xi+i) and so to provide an adaptive estimator 
of the transition density % = <?//. We give bounds in 1? norm for these estimators 
and we show that they are adaptive in the minimax sense over a large class of Besov 
spaces. Some examples and simulations are also provided. 

Key words: Adaptive estimation, Markov Chain, Stationary density, Transition 
density, Model selection, Penalized contrast, Projection estimators 



1 Introduction 



Nonparametric estimation is now a very rich branch of statistical theory. The 
case of i.i.d. observations is the most detailed but many authors are also in- 
terest ed in the case of Markov processes. Early results are stated by iRoussas 
( 119691 ). who studies nonparametric estimators of the stationary density and 
the transition density of a Markov chain. He considers kernel estimato rs and 
assum es that the chain satisfies the strong Doeblin's condition (Do) (see lDoob 
(119531 ) p.221). He shows consistency and asymptotic normality of his estima- 
tor. Sever al authors tried to consider weaker assumptions than the Doeblin's 
condition. iRosenblattl (119701 ) introduces an other condition, denoted by (G2), 
and he gives results on the bias and the variance of the ke r nel es timator of 
the invariant density in this weaker framework. lYakowitzl (119891 ) improves 
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also the result of asymptotic normality by c onsidering a Harri s -cond ition. 
The study of kernel estimators is completed by iMasry and Gyorfil (119871 ) who 
find sharp rates f or th is kind of estimators of the stationary density and by 
Basu and Sahool (119981 ) who prove a Berry-Esseen inequality under the con- 
dition (Gz) of Rosenblatt. Other authors are interested in the estimation of 
the i nvariant distribution and the transition density in the non-stationary 
case: iDoukhan and Ghindcs (1983 ) bound the integrated risks for any initial 



distribution. In lHernandez-Lerma et al.l (119881). recursive estimators for a non- 
stationary Markov chain are described. iLiebscherl (119921 ) gives results for the 
invariant density in this non- stationary framework using a condition denoted 
by (Di) derived from the Doeblin's condition but weaker than (-Do). All the 
above papers deal with kernel esti mators. Am ong those who are not interested 
in such estimators, let us mention iBosql (119731 ) who s tudies an estimator of the 
stationary density by projection on a Fourier basis, iPrakasa Raol (119781 ) who 
outl ines a new estimator for the s tationary density by using delta-sequences 
and iGillert and Wartenbergl (119841 ) who present estimators based on Hermite 
bases or trigonometric bases. 



The recent work of IClemenconl (119991 ) allows to measure the performance of 
all these estimators since he proves lower bounds for the minimax rates and 
gives thus the optimal convergence rates for the estimation of the stationary 
density and the transition density. Clemengon also provides an other kind of 
estimator for the stationary density and for the transition density, that he 
obtains by projection on wavelet bases. He presents an adaptive procedure 
which is "quasi-optimal" in the sense that the procedure reaches almost the 
optimal rate but with a logarithmic loss. He needs other conditions than those 
we cited above and in particular a minoration condition derived from Num- 
melin's (1984) works. In this paper, we will use the same condition. 



The aim of this paper is to estimate the stationary density of a discrete-time 
Markov chain and its transition density. We consider an irreducible positive 
recurrent Markov chain (X n ) with a stationary density denoted by /. We 
suppose that the initial density is / (hence the process is stationary) and 
we construct an estimator / from the data Xi, . . . ,X n . Then, we study the 
mean integrated squared error E||/ — f\\^ and its convergence rate. The same 
technique enables to estimate the density g of (Xi, X i+1 ) and so to provide an 
estimator of the transition density n = g/f, called the quotient estimator. 



An adaptative procedure is proposed for the two estimations and it is proved 
that both resulting estimators reach the optimal minimax rates without ad- 
ditive logarithmic factor. 



We will use h ere some technic a l methods known as the N u mmelin splitting 

techni que (see lNummelinl ( 119841 ). iMeyn and Tweedid (119931 ) or lHopfner and Locherbach 
( 120031 )). This method allows to reduce the general state space Markov chain 
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theory to the countable space theory. Actually, the splitting of the original 
chain creates an artificial accessible atom and we will use the hitting times 
to this atom to decompose the chain, as we would have done for a countable 
space chain. 



To build our estimator of f, we use model selection via penalization as de- 
scribed in iBarron et al.l (119991 ) . First, estimators by projection denoted by f m 
are considered. The index m denotes the model, i.e. the subspace to which 
the estimator belongs. Then the model selection technique allows to select 
automatically an estimator f m from the collection of estimators (f m ). The 
estimator of g is built in the same way. The collections of models that we con- 
sider here include wavelets but also trigonometric polynomials and piecewise 
polynomials. 



This paper is organized as follows. In Section 2, we present our assumptions 
on the Markov chain and on the collections of models. We give also examples 
of chains and models. Section 3 is devoted to estimation of the stationary 
density and in Section 4 the estimation of the transition density is explained. 
Some simulations are presented in Section 5. The proofs are gathered in the 
last section, which contains also a presentation of the Nummelin splitting 
technique. 



2 The framework 



2.1 Assumptions on the Markov chain 



We consider an irreducible Markov chain (X n ) taking its values in the real 
line M. We suppose that (X n ) is positive recurrent, i.e. i t admits a station- 



ary p robability measure fi (for more details, we refer to iMeyn and Tweedie 



( 119931 )). We assume that the distribution \i has a density / with respect to 
the Lebesgue measure and it is this quantity that we want to estimate. Since 
the number of observations is finite, / is estimated on a compact set only. 
Without loss of generality, this compact set is assumed to be equal to [0, 1] 
and, from now, / denotes the transition density multiplied by the indicator 
function of [0,1] /l[o,i]- More precisely, the Markov process is supposed to 
satisfy the following assumptions: 

Al. (X n ) is irreducible and positive recurrent. 

A2. The distribution of X is equal to // , thus the chain is (strictly) stationary. 
A3. The stationary density / belongs to L°°([0, 1]) i.e. sup xg [ 01 ] \f(x)\ < oo 
A4. The chain is strongly aperiodic, i.e. it satisfies the following minorization 
condition: there is some function ft, : [0, 1] i — »• [0, 1] with / hdfi > and a 
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positive distribution v such that, for all event A and for all x, 



P{x,A) > h{x)v(A) 



where P is the transition kernel of (X n ). 
A5. The chain is geometrically ergodic, i.e. there exists a function V > finite 
and a constant p G (0, 1) such that, for all n > 1 

\\F n (x,.)-^\\ TV <V(x)p n 

where ||.||rv is the total variation norm. 

We can remark that condition A3 implies that / belongs to L 2 ([0, 1]) where 

L 2 ([0, 1]) = {t : R i-> R, Supp(t) C [0, 1] and \\t\\ 2 = [ t 2 (x)dx < oo}. 

Jo 

Notice that, if the chain is aperiodic, condition A4 holds, at least for some 
m-skeleton (i.e. a chain w ith transition probability P m ) (see Theorem 5.2.2 in 



Meyn and Tweedid ( 119931 )). This minorization c ondition is u sed in the Num- 
melin splitting technique and is also required in IClemenconl (119991 ) . 



The last assumption, which is called geometric regularity by lClemenconl (1200(1 ). 
means that the con vergence of the chai n to th e invariant distribution is ge- 
ometrically fast. In iMeyn and Tweedid (119931 ). we find a slightly different 
condition (replacing the total variation norm by the ^/-norm). This condi- 
tion, which is sufficient for A5, is widely used in Monte Carlo Markov Chain 
literature because it guarantees central limit the orems and enables t o sim - 
ulate laws via a Markoy cha in ( see for example iJarner and Hansen! ( 120001 ). 
Roberts and Rosenthal! (119981 ) or lMeyn and Tweedie (Il994t n. 



The following subsection gives some examples of Markov chains satisfying 
hypotheses A1-A5. 



2.2 Examples of chains 



2. 2. 1 Diffusion processes 

We consider the process (AjA)i<i<n where A > is the observation step and 
(X t ) t > is defined by 

dX t = b(X t )dt + a(X t )dW t 

where W is the standard Brownian motion, b a is a locally bounded Borelian 
function and a is a uniformly continuous function such that: 

(1) there exists A_, A + such that Vx ^ 0, < A_ < cr 2 (x) < A+, 

(2) there exists M ,a > and r > such that V|x| > M ,xb(x) < — r|x| Q+1 . 
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Then, if Xo follows the stationary distribution, Proposition 1 in lPardoux and Veretennikov 
(120011 ) shows that the discretized process (XjA)i<i< n satisfies Assumptions 
A1-A5. 



2.2.2 Nonlinear AR(1 ) processes 
Let us consider the following process 

X n = (p(X n -i) + £x n _i,n 

where e X)H has a positive density l x with respect to the Lebesgue measure, 
which does not depend on n. We suppose that (p is bounded on any compact set 
and that there ex ist M > and p < 1 such that, for all \x\ > M, \<p(x)\ < p\x\. 



Mokkademl (119871 ) proves that if there exists s > such that sup x Ele^l 5 < oo, 
then the chain is geometrically ergodic. If we assume furthermore that l x has 
a lower bound then the chain satisfies all the previous assumptions. 



2.2.3 ARX (1,1) models 

The nonlinear process ARX(1,1) is defined by 

X n = F[X n -i, Z n ) + £ n 

where F is bounded and (£ n ), (Z n ) are independent sequences of i.i.d. random 
variables with E|£„| < oo. We suppose that the distribution of Z n has a positive 
density / with respect to the Lebesgue mesure. Assume that there exist p < 1, 
a locally bounded and mesurable function /i:Rh> IR + such that Kh(Z n ) < oo 
and positive constants M, c such that 

W\(u,v)\>M \F(u, v)\ < p\u\ + h(v) — c and sup |F(a;)| < oo. 

\x\<M 



Then the process (X n ) satisfies Assumptions A1-A5 (see iDoukhanl (119941 ) 
p.102). 



2.2.4 ARCH process 
The considered model is 

X n+ i = F{X n ) + G(X n )e n+ i 

where F and G are continuous functions and for all x, G{x) 7^ 0. We suppose 
that the distribution of e n has a positive and continuous density with respect 
to the Lebesgue measure and that there exists s > 1 such that E|e n | s < 00. 
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The chain {Xj) satisfies Assumptions A1-A5 if (see iDoukhanl (119941 ) p. 106): 



\F{x)\ + \G{x)\(E 




s\l/s 




X 





lim sup : — : < 1 



2.3 Assumptions on the models 



In order to estimate /, we need to introduce some collections of models. The 
assumptions on the models are the following: 

Ml. Each S m is a linear subspace of (L°°nL 2 )([0, 1]) with dimension D m < -Jri 
M2. Let 

1 \ fy~| 



sup 



Tin An - r ii, II 

V^miG5 m \{0} ||C|| 

There exists a real r such that for all m, (fi m < r . 



This assumption (L 2 -L°° connexion) is introduced by lBarron et aJj (119991 ) and 
can be written: 



Vt G S m \\t\\oo <r yD m \\t\\. (1) 

We get then a set of models {S m ) m( z Mn where M n = {m, D m < y/n}. We 
need now a last assumption regarding the whole collection, which ensures that, 
for m and m' in A4 n , S m + S' m belongs to the collection of models. 

M3. The models are nested, that is for all m, D m < D m i =^> S m C S m >. 



2.4 Examples of models 



We show here that the assumptions M1-M3 are not too restrictive . Indeed, they 



are ve rified for the models spanned by the following bases (see iBarron et al 



(119991 ^ 



Histogram basis: 5* m =< (pi,...,cp 2 m > with a?,- = 2 m//2 l r j-i j , for j = 

1, . . . , 2 m . Here D m = 2 m , r = 1 and M n = {1, • • • , \\nn/2 ln2j } where 
[x\ denotes the floor of x, i.e. the largest integer less than or equal to x. 
Trigonometric basis: S m =< (p , . . . ,<fi m -x > with (po(x) = l[ 0i i](x), ip 2 j = 
\[2 cos(27rjx)l[o,i](x), <^2j-i = V / 2sin(27rjx)l[o,i](x) for j > 1. For this 
model D m = m and tq = \[2 hold. 

Regular piecewise polynomial basis: S m is spanned by polynomials of degree 
0, . . . , r (where r is fixed) on each interval [(j — 1)/2 D , j/2 [, j = 1, . . . , 2 D . 
In this case, m = (D,r), D m = (r + 1)2 D and M n = {(D,r), D = 
1, . . . , Llog 2 (vW( r + !))J}- We can P ut r o = Vr + 1. 
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Regular wavelet basis: S m =< ipjk,j = — 1, • • • ,m,k G A(j) > where ?/>-i,fc 
points out the translates of the father wavelet and ^jk(x) = 2^ 2 tp(2 : 'x — k) 
where ip is the mother wavelet. We assume that the support of the wavelets is 
included in [0, 1] and that = (p belongs to the Sobolev space W£. In this 
framework A(j) = {0, . . . , K2 j — 1} (for j > 0) where K is a constant which 
depends on the supports of tp and ip: for example for the Haar basis K — 1. 
We have then D m = Ef=-i \HJ)\ = l A (-!)l + K(2 m+1 - 1). Moreover 

, ^ E fc |^-i, fc | + Er=o2 j/2 E fc |^, fc | 



< |b||ooV||^||oo(l + Er = o2^ 2 ) < MooV - . 

^(K A |A(— l)|)2 m+1 ~ ^A|A(-1)' 



3 Estimation of the stationary density 



3. 1 Decomposition of the risk for the projection estimator 



Let 

i n 

7n(0 = -E[ll*ll 2 - 2 Wl- ( 2 ) 

" 1=1 

Notice that E(7„(t)) = ||t — f\\ 2 — \\f\\ 2 and therefore 7 ra (t) is the empirical 
version of the L 2 distance between t and /. Thus, f m is defined by 

f m = argmin7„(t) (3) 

where S m is a subspace of L 2 which satisfies M2. Although this estimator 
depends on n, no index n is mentioned in order to simplify the notations . It 
is also the case for all the estimators in this paper. 

A more explicit formula for f m is easy to derive: 

/ m = p x = -J2Mx t ) (4) 

AeA n i=i 

where (v^a)aga is an orthonormal basis of S m . Note that 

E(/m) = E < /' V^A > ^A, 

aga 

which is the projection of / on S m . 

In order to evaluate the quality of this estimator, we now compute the mean 
integrated squared error E||/ — f m \\ 2 (often denoted by MISE). 
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Proposition 1 Let X n be a Markov chain which satisfies Assumptions A1-A5 
and S m be a subspace of L 2 with dimension D m < n. If S m satisfies condition 
M2, then the estimator f m defined by fl3]) satisfies 



E\\f-f m \\ 2 <d 2 (f,S m ) + C 



D r 



n 



where C is a constant which does not depend on n. 

To compute the bias term d(f, S m ), we assume that f bel ongs to the Besov 
space B% ^([0, 1]). We refer to iDeVore and Lorentzl (119931 ) p. 54 for the def- 



DeVore and Lorentzl (119931 ) 



inition of B^^dO, 1]). Notice that when a is a n integer, the Besoy space 
.BfoodP)!]) contains the Sobolev space (see 
p.51-55). 

Hence, we have the following corollary. 



Corollary 2 Let X n be a Markov chain which satisfies Assumptions A1-A5. 
Assume that the stationary density f belongs to B% ^([0, 1]) and that S m is one 
of the spaces mentioned in Section 2.J\ (with the regularity of polynomials and 
wavelets larger than a — 1). If we choose D m = [n 2 ^ 1 J , then the estimator 
defined by ([3]) satisfies 



n\f-fn 



0(n 2a + 1 ' 



We can notice that we obtain the same rate than in the i.i.d. case (see 



Donoho et aD (Il996h ). Actually, Iciemenconl (Il999h proves that n 2a + 1 is the 



optimal rate in the mini max sense in the Markovian fr amework. With very 
different theoretical tools, iTribouley and Viennetl (119981 ) show that this rate 
is also reached in the case of the univariate density estimation of /3-mixing 
random variables by using a wavelet estimator. 



However, the choice D m = [n 2 ^ 1 J is possible only if we know the regularity 
a of the unknown /. But generally, it is not the case. It is the reason why we 
construct an adaptive estimator, i.e. an estimator which achieves the optimal 
rate without requiring the knowledge of a. 



3.2 Adaptive estimation 



Let (S m )meM„ be a collection of models as described in Section [2731 For each 
S m , fm is defined as above by (J3j). Next, we choose m among the family M. n 
such that 

rh = argmin[7 n (/ m ) + pen(m)] 

rn€M n 
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where pen is a penalty function to be specified later. We denote f = fm and 
we bound the L 2 -risk E||/ — / || as follows. 

Theorem 3 Let X n be a Markov chain which satisfies Assumptions A1-A5 
and {S m ) m( zMn be a collection of models satisfying Assumptions M1-M3. Then 
the estimator defined by 

f = frh where rh = argmin[7„(/ m ) + pen(m)], (5) 

m<=Mn 



with 



pen(m) = K — — for some K > Kq (6) 
n 

(where K is a constant depending on the chain) satisfies 

E||/-/|| 2 <3 inf { d 2 (f,S m )+ V en(m)} + ^ 

meM n n 

where C\ does not depend on n. 

Remark 4 The constant Kq in the penalty depends only on the distribution 
of the chain and can be chosen equal to max(rQ, l)(Ci + C2II/H00) where C\ 
and Ci are theoretical constants provided by the Nummelin splitting technique. 
The number r is known and depends on the chosen base (see subsection \2.3\) . 
The mention of \\f\\oo in the penalty term seems to be a problem, seeing that f 
is unknown. Actually, we could replace \\f\\oo by \\f\\oo wit h f an estimator of 



f. Th is m ethod of rando m penalty is successfully applied in \Birge and Massari 



( 199$ ) or Comte (2001 ) for example. But we choose not to use this method 



here , since the c onsta nts G\ and C 2 in K are not computable either. Notice 
that lClemengon hood ) handle with the same kind of unknown quantities in the 



threshold of his nonlinear wavelet estimator. Actually it is the price to pay for 
de aling with dependent variables (see also the mixing constant in the threshold 



in 



Tribouleu and Viennet (1994 ) ). But this annoyance can be circumvented for 



practical purposes. Indeed, for the simulations the computatio n o f the penalty i s 



hand-adjusted. Some techniques of calibration can be found in \Lebarbier hood) 



in the context of multiple change point detection. In a Gaussian framework the 
prac tical choice of the p e nalty for implementation is also discussed in Section 



4 of lBirge and Massari \200i ). 



Corollary 5 Let X n be a Markov chain which satisfies Assumptions A1-A5 
and (S m ) me M n be a collection of models mentioned in Section \2~4\ (with the 
regularity of polynomials and wavelets larger than a — 1). If f belongs to 
i?2 jOO ([0, 1]), with a > 1/2, then the estimator defined by (jSJ) and (jSJ) satisfies 

nl-f\\ 2 = o(n-^) 

Remark 6 When a > \, ([0, 1]) C C[0, 1] (where C[0, 1] is the set of 
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the continuous functions with support in [0, 1}) and then the assumption A3 
||/ ||oo < °° is superfluous. 



We have already n oticed that i t is the optimal rate in the minimax sense (see 
the lower bound in IClemencon fll999f a Note that here the procedure reaches 
this rate whatever the regularity of /, withou t need ing to know a. This result is 
thus a improvement of the one of lClemenconl (119991 ) . whose adaptive procedure 

2a. 

achieves only the rate (log(n) / Vi) 2a+1 . Moreover, our procedure allows to use 
more bases (not only wavelets) and is easy to implement. 



4 Estimation of the transition density 

We now suppose that the transition kernel P has a density tc. In order to 
estimate 7T, we remark that tc can be written g/f where g is the density of 
(Xi, Xi + \). Thus we begin with the estimation of g. As previously, g and it are 
estimated on a compact set which is assumed to be equal to [0, l] 2 , without 
loss of generality. 



4-1 Estimation of the joint density g 

We need now a new assumption. 
A3'. 7T belongs to L°°([0,1] 2 ). 

Notice that A3' implies A3. We consider now the following subspaces. 

^ = {ta 2 ([0,lf), t(x,y)= £ a^ Vx (x) V M} 

A,/teA m 

where (^a) AeA m is an orthonormal basis of S m . Notice that, if we set 



sup ITT' 

hypothesis M2 implies that (f>$ is bounded by r$. The condition Ml must be 
replaced by the following condition: 

Ml'. Each Sffl is a linear subspace of (L°°nL 2 )([0, l] 2 ) with dimension D 2 m < y/n. 
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Let now 

n— 1 



,(2) v 

Ti 1 

i=l 

We define as above 



7n*) = — ,E{ii^i 2 -2t(^,x m )}. 



<? m = argmin7^ j (t) 

t€zS m 

and = argmin^^g^J+pen^m)] where pen^(m) is a penalty function 
which would be specified later. Lastly, we set g = . 

Theorem 7 Let X n be a Markov chain which satisfies Assumptions A1-A2- 
A3'-A4-A5 and (S m ) 

meMn be a collection of models satisfying Assumptions 
Ml '-M2-M3. Then the estimator defined by 

9 — 9mW where mr' = argmin[7^ 2 ^(^ m ) + pen^(m)], (7) 

m&Mn 



with 



D 2 

pen (2) (m) = K {2) ^ for some K {2) > id 2) 
n 



(where Kq is a constant depending on the chain) satisfies 

E\\~g - g\\ 2 < 3 inf {d 2 (g, S%?) + pen^ 2 \m)} + ^ 

m€M n n 

where C\ does not depend on n. 

The constant K^f 1 in the penalty is similar to the constant K in Theorem [3] 
(replacing r by r 2 , and ||/||oo by H^Hoo)- We refer the reader to Remark @] for 
considerations related to these constants. 

Corollary 8 Let X n be a Markov chain which satisfies Assumptions A1-A2- 
A3'-A4-A5 and {S m ) m( zM n be a collection of models mentioned in Section 2.4 



(with the regularity of polynomials and wavelets larger than a — 1). If g belongs 
to 5 2,oo([°> l T)> Wlth « > 1, then 

E\\g - g\\ 2 = 0(n~~£+2) 



This rate of convergence is the minimax rate for density e stimation in dimen- 

sion 2 in the case of i.i.d. random variables (see for instance llbragimov and Has'minskii 
(1l980l )). Let us now proceed to the estimation of the transition density. 
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4-2 Estimation of tt 



The estimator of it is defined in the following way. Let 
7t(x,y) 



X\~g(x,y)\<a n \f(x)\ 



else 



with a n = n 13 and (3 < 1/8. 

We introduce a new assumption: 

A6. There exists a positive constant x such that Vx G [0, 1], f(x) > x- 



Theorem 9 Let X n be a Markov chain which satisfies Assumptions A1-A2- 
A3'-A4-A5-A6 and {S m ) m& M n be a collection of models mentioned in Section 
2.4\ (with the regularity of polynomials and wavelets larger than a — 1). We 



suppose that the dimension D m of the models is such that 

Vm G M n Inn < D m < n 1/4 . 
If f belongs to ^([0, 1]), with a > 1/2, then for n large enough 

• there exists C\ and C<i such that 

E||7r - 7f|| 2 < C x n\g -gf + C 2 E\\f - f\\ 2 + o(-) 

n 

• if furthermore g belongs to -B^ooQO, l] 2 ) (with (3 > 1), then 

119 2/3 2a 

E||7r — 7r|| = 0(sup(n 2fl + 2 ,n 2o+x)) 



Clemencon ( 20001 ) proved that n 2 ^/( 2 ^+ 2 ) is the minimax rate for / and g 



of same regularity (3. Notice that in this case the procedure is adaptive and 
there is no logari thmic loss in the estimation rate contrary to the result of 
Clemencon! (tod ). 



But it should be remembered that we consider only the restriction of / or 
7r since the observations are in a compact set. And the restriction of the 
stationary density to [0, 1] may be less regular than the restriction of the 
transition density. The previous procedure has thus the disadvantage that the 
resulting rate does not depend only on the regularity of it but also on the one 
of/. 

However, if the chain lives on [0, 1] and if g belongs to ^^([O, l] 2 ) (that is 
to say that we consider the regularity of g on its whole support and not only 
on the compact of the observations) then equality f(y) = f g(x,y)dx yields 
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that / belongs to f?f )DO ([0, 1]) and then E||7r-7r|| 2 = 0(rT"^). Moreover, if n 
belongs to -B^GO, 1] 2 )j formula f(y) = J f(x)ir(x, y)dx implies that / belongs 
to_B£ oo ([0, 1]). Then, by using properties of Besov spaces (see lRunst and Sickel 
fll996h p.192), g = fir belongs to B% {% l] 2 ). So in this case of a chain 



with compact support the minimax rate is achieved as soon as tt belongs to 
£LoG0,l] 2 ) with/3>l. 



5 Simulations 



The computation of the previous estimator is very simple. We use the following 
procedure in 3 steps: 

First step: 

• For each m, compute j n (f m ) + pen(m). Notice that j n (f m ) = ~ EasA™ Pi 
where ft\ is defined by (HJ) and is quickly computed. 

• Select the argmin rh of j n (fm) + pen(ra). 

• Choose f = EagA* /^a- 
Second step: 

• For each m such that < y/n compute 7^(<? m ) + pen^(m), with 
InKdm) = ~ EA, M eA m where a A , M = (1/n) E?=i f\(Xi)if^(X i+1 ). 

• Select the argmin of 7^(p m ) + pen( 2 )(m). 

• Choose g(x,y) = Ex^eA^ a\,^x(x)if^(y). 

Third step: Compute ft(x,y) = g(x,y)/f(x) if \g(x,y)\ < n 1 ^ 10 \f(x)\ and 
otherwise. 

The bases are here adjusted with an affin transform in order to be defined on 
the estimation interval [c, d] instead of [0, 1]. We consider 2 different bases (see 
Section [2^1) : trigonometric basis and histogram basis. 

We found that a good choice for the penalty functions is pen(m) = 5D m /n 
and pen( 2 )(m) = 0mD 2 Jn. 

We consider several kinds of Markov chains : 

• An autoregressive process denoted by AR and defined by: 

X n+1 = aX n + b + 

where the e n+ i are independent and identical distributed random variables, 
with centered Gaussian distribution with variance a 2 . For this process, the 
stationary distribution is a Gaussian with mean 6/(1 — a) and variance 
cr 2 /(l — a 2 ). By denoting by <p(z) = l/(av / 27r) exp(— z 2 /2a 2 ) the Gaussian 
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density, the transition density can be written ir(x,y) = <p(y — ax — b). We 
consider the following parameter values : 

(i) a = 2/3, b = 0, a 2 = 5/9, estimated on [—2, 2] 2 . The stationary density of 
this chain is the standard Gaussian distribution. 

(ii) a = 0.5, b = 3, a 2 = 1, and then the process is estimated on [4, 8] 2 . 

• A radial Ornstein-Uhlenbeck process (in its discrete version). For j = 1, ... ,5, 
we define the processes: = a^ n + jSe^ where the are i.i.d. standard 



Gaussian. The cha in is then defined by X n = y T^ILt (£? ,) 2 . Th e transition 
density is given in IChaleyat-Maurel and Genon-Catalotl ( 20061 ) where this 



process is studied in detail: 
ir(x,y) = t y>0 exp 




ax J 

and I s/2-i is the Bessel function with index 5/2 — 1. The invariant density 
is f(x) = Ct x>0 exp(— x 2 /2p 2 )x 5 ~ 1 with p 2 = (3 2 /{l — a 2 ) and C such that 
/ / = 1. This process (with here a = 0.5, f3 = 3, 8 = 3) is denoted by V CIR 
since its square is actually a Cox-Ingersoll-Ross process. The estimation 
domain for this process is [2, 10] 2 . 

A Cox-Ingersoll-Ross process, which is exactly the square of the previous 
process. It follows a Gamma density for invariant distribution with scale 
parameter I = l/2p 2 and shape parameter a = 5/2. The transition density 
is 

1 / y + a 2 x\ ( a^fxy\ ( y \ 5 / 4 -V2 

The used parameters are the following: 




(iii) a = 3/4, b = \/7/48 (so that I = 3/2) and 5 = 4, estimated on [0.1, 3] 2 . 

(iv) a — 1/3, 6 = 3/4 and 5 = 2. This chain is estimated on [0, 2] 2 . 

• An ARCH process defined by X n+ i = sin(X„) + (cos(X n ) + 3)e n+ i where 
the e n+ i are i.i.d. standard Gaussian. The transition density of this chain is 

( y — sin(x) \ 1 

ir(x,y) = (p 



cos(x) + 3 J cos(x) + 3 

and we estimate this process on [—5, 5] 2 . 

For this last chain, the stationary density is not explicit. So we simulate ra+500 
variables and we estimate only from the last n to ensure the stationarity of 
the process. For the other chains, it is sufficient to simulate an initial variable 
X with density /. 

Figure [1] illustrates the performance of the method and Table [1] shows the 
L 2 -risk for different values of n. 

The results in Table [1] are roughly good and illustrate that we can not pre- 
tend that a basis among the others gives better results. We can then imag- 
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Fig. 1. Estimator (light surface) and true transition (dark surface) for the process 
CIR(iii) estimated with a trigonometric basis, n=1000 



n 



50 



100 



250 



500 



1000 



basis 



AR(i) 


0.7280 


0.5442 


0.2773 


0.1868 


0.1767 


H 




0.5262 


0.4682 


0.2223 


0.1797 


0.1478 


T 


AR(ii) 


0.4798 


0.3252 


0.2249 


0.1160 


0.0842 


H 




0.2867 


0.2393 


0.1770 


0.1342 


0.1083 


T 


VCIR 


0.3054 


0.2324 


0.1724 


0.1523 


0.1278 


H 




0.2157 


0.1939 


0.1450 


0.1284 


0.0815 


T 


CIR(iii) 


0.5086 


0.3082 


0.2113 


0.1760 


0.1477 


H 




0.4170 


0.3959 


0.2843 


0.2565 


0.2265 


T 


CIR(iv) 


0.3381 


0.2101 


0.1205 


0.0756 


0.0458 


H 




0.2273 


0.2212 


0.1715 


0.1338 


0.1328 


T 


ARCH 


0.3170 


0.3013 


0.2420 


0.2124 


0.1610 


H 




0.2553 


0.2541 


0.2075 


0.1884 


0.1689 


T 



Table 1 

MISE IE 1 1 -7T — 7r|| 2 averaged over N = 200 samples. H: histogram basis, T: trigono- 
metric basis. 

ine a mixed strategy, i.e. a procedure which uses several kinds of bases and 
which can choose the best basis or, for instance, the best degree for a poly- 
no mial basis. These techn i ques are suc cessfully used in regression frameworks 
by IComte and Rozenholcj ffeool 120041 ). 
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The results for the stationary density are given in Table [2j 



n 50 100 250 500 1000 basis 



AR(i) 


0.0658 


0.0599 


0.0329 


0.0137 


0.0122 


H 




0.0569 


0.0538 


0.0246 


0.0040 


0.0026 


T 


AR(ii) 


0.0388 


0.0354 


0.0309 


0.0147 


0.0081 


H 




0.0342 


0.0342 


0.0327 


0.0195 


0.0054 


T 


VCIR 


0.0127 


0.0115 


0.0105 


0.0102 


0.0096 


H 




0.0169 


0.0169 


0.0168 


0.0166 


0.0107 


T 


CIR(iii) 


0.0335 


0.0268 


0.0229 


0.0222 


0.0210 


H 




0.0630 


0.0385 


0.0216 


0.0211 


0.0191 


T 


CIR(iv) 


0.0317 


0.0249 


0.0223 


0.0185 


0.0103 


H 




0.0873 


0.0734 


0.0572 


0.0522 


0.0458 


T 



Table 2 

MISE E||/ — f\\ 2 averaged over N = 200 samples. H: histogram basis, T: trigono- 
metric basis. 



We can compare results of Table [2] with those of iDaleland (120051 1 who gives 
results of simulations for i.i.d. random variables. For density estimation, she 
uses three types of kernel: Gauss kernel, sinc-kernel (where sinc(a:) = sin(x)/x) 
and her Cross Validation optimal kernel (denoted by Dal). Table |3] gives her 
results for the Gaussian density and the Gamma distribution with the same 
parameters that we used (2 and 3/2). If we compare the results that she 
obtains with her optimal kernel and our results with the trigonometric basis, 
we observe that her risks are about 5 times less than ours. However this kernel 
is particularly effective and if we consider the classical kernels, we notice that 
the results are almost comparable, with a reasonable price for dependency. 



6 Proofs 



6.1 The Nummelin splitting technique 



This whole subsection is summarized from iHopfner and Locherbachl (120031 ) 



p. 60-63 and is detailed for the sake of completeness. 

The interest of the Nummelin splitting technique is to create a two-dimensional 
chain (the "split chain"), which contains automatically an atom. Let us recall 
the definition of an atom. Let A be a set such that ip(A) > where ip is an 
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n 


100 


500 


1000 


kernel 




U.UUuo 


U.UUlo 


U.UUUo 


uai 


Gaussian 


U.UiZ I 


U.UUzo 


U.UUlo 


Lxauss 


(=AR(i)) 


0.0114 


0.0026 


0.0010 


sine 




0.0148 


0.0052 


0.0027 


Dal 


Gamma 


0.0209 


0.0061 


0.0031 


Gauss 


(=CIR(iii)) 


0.0403 


0.0166 


0.0037 


sine 



Table 3 

MISE obtained by iDalelane (l2005h for i.i.d. data, averaged over 50 samples 



irreducibility measure. The set A is called an atom for the chain (X n ) with 
transition kernel P if there exists a measure v such that P(x, B) = v(B), for 
all x in A and for all event B. 

Let us now describe the splitting method. Let E = [0, 1] the state space and 
E the associated a-field. Each point x in E is splitted in x = (x, 0) G E — 
E x {0} and X\ = (x, 1) G i?i = E x {1}. Each set A in £ is splitted in 
A = A x {0} and At = A x {1}. Thus, we have defined a new probability 
space (E*,£*) where := E U ^ and £* = a(A ,A 1 : A G £). Using /i 
defined in A4, a measure A on (E, 8) splits according to 

= / tji(x)h(x)X(dx) 
= fl A {x){l ~ h)(x)X(dx) 

Notice that X*(A U Ai) — X(A). Now the aim is to define a new transition 
probability P*(., .) on (E*,S*) to replace the transition kernel P of (X n ). Let 

' (P - h®v)*(x,.) if z = and h(x) > 1 




else 

where v is the measure introduced in A4 and h <8> ^ is a kernel defined by 
/i ® b>(x,dy) = h(x)u(dy). Consider now a chain (X*) on (E*,S*) with one- 
step transition P* and with starting law /i*. The split chain (X*) has the 
following properties: 

PI. For all {A p ) < p < N G £ N and for all measure A 

p x (x p g a p , o < p < n) = p x *(x; G A p x {0, 1},0 < p < iV). 

P2. The split chain is irreducible positive recurrent with stationary distribu- 
tion jJL*. 
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P3. The set E\ is an atom for (X*). 

We can also extend functions g : E i— > R to £7* via g*(xo) = fl^) = fl^O^i)- 
Then, the property PI can be written: for all function ^-measurable g : E i— > 

R 

E A (^(X 1 ,..,^))=E A .(^(X 1 *,.,^)). 

We can say that is a marginal chain of (X*). When necessary, the follow- 
ing proofs are decomposed in two steps: first, we assume that the Markov chain 
has an atom, next we extend the result to the general chain by introducing 
the artificial atom E\. 



6. 2 Proof of Proposition Q] 

First step: We suppose that (X n ) has an atom A. 

Let f m be the orthogonal projection of / on S m . Pythagoras theorem gives us: 

n\f-fm\\ 2 =d 2 (f,S m )+E\\fm-fm\\ 2 - 

We recognize in the right member a bias term and a variance term. According 
to the expresssion (J3J) of f m the variance term can be written: 

E||/ m -/ m f= E Var(/3 A )= £ E(z/ 2 (y? A )) (9) 
AeA m AeA m 

where v n {t) = (l/n)E? = i[t(Ij)- < t, f >]. By denoting r = r(l) = inf{n > 
1, X n G A} and r(j) = mf{n > t(J — l),X n G A} for j > 2, we can decompose 
v n (t) in the classic following way: 

"»(*) = + + 4 3) (t) + ^(*) (io) 

with = v n (t)t T>n , 

^\t) = -±[t(X t )-<tJ>}t T 



n i=i 



V, 



(3) 



1 



<«,) 

T(«n) 



(*) = - E <*,/>]lr<n, 



7/, 



(1) 



n i=l+r(l) 
1 



(*) = - E Pi)-<U>]lr<n, 



U i=r{l n ) + l 



and Z n = X)jt=i l^pQ) (number of visits to the atom A). Hence, 
^(t) 2 < 4{z, n «(t) 2 + ^ 2 )(t) 2 + v n ^\tf + ^(t) 2 }- 
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• To bound ujp(t) 2 , notice that < 211*11^. And then, by using M2 and 

(ffl), W£ ] {t)\ < 2r VA^||*||lr>n-Thus, 



E(^)(t) 2 ) <4r 2 ||t|| 2 D m P(r > n) < 4r 2 ||t|| 2 E(r 2 )^ 



• We bound the second term in the same way. Since I^^COI — 2(r/n 
we obtain \v^(t)\ < 2\\t\\r T\/D rn /n and then 

E(^(*) 2 )<4r 2 ||*|| 2 E(r 2 )^. 



• Let us study now the fourth term. As 

\^\t)\ < 2 n ~ T( ^ p|Ul T<n < 2(n - r(U)^V Q ||t||l T<n , 
n n 

we get E(z/( 4 )(t) 2 ) < 4r 2 ||t|| 2 ^E((n - r(/„)) 2 l T < n ). 
It remains to bound E((n — r(/ n )) 2 l T < n ): 



E^((n - r(/ n )) 2 l r < n ) = J2 M( n - k) 2 t T(ln)=k t T < n ) 

k=l 

n 

= J2(™ ~ k) 2 P^(X k+1 i A ..,X n i A\X k e A)P fl {X k e A) 

k=l 



= J2{n- k) 2 P A (X 1 i A, .., X n „ k i A)fi(A) 

k=l 

by using the stationarity of X and the Markov property. Hence 



n 



E„((n - r(y) 2 l T < n ) = £(n ~ k ? p A(T >n- k) 11(A) 

k=l 

Therefore E (U ((n - r(/ n )) 2 l T < n ) < 2E A (r 4 ) / u(A). Finally 



n 2 



and we can summarize the last three results by 

E(^W + ^W 2 + ^W) <8r 2 ||t|| 2 [E M (r 2 )+MA)E A (r 4 )]^. (11) 
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In particular, if t = (p\, using that D m < n, 

,EJt 2 ) + {J,(A)E a {t*) 



2^ < 8r 2-M* 



n 



Last we can write v$i\t) = (1/n) X^/L] 1 5'j(t)l r < n where 



tCj+i) 

S,(t) = E (W- <*,/>). (12) 

t=l+rO') 

We remark that, according to the Markov property, the are independent 
identically distributed and centered. Thus, 



1 ^ n ~ i 

e(^w 2 )<^E e I^WI 2 . 



«n-l 

E 

3=1 

Then, we use Lemma ITUl below to bound the expectation of u^(ipx) 2 : 
Lemma 10 For all m>2, E At |S 7 -(t)| m < (2\\t\\ OD ) m - 2 \\f\\ O0 \\t\\ 2 E A (r rn ). 
We can then give the bound 

em?W) < - 2 t\\fUM\ 2 Mr 2 ) < ll/IIJBU(r2) 



Finally 



rr ■—[ n 



HKM) < ^[8r 2 (E M (r 2 ) + /,(A)E A (r 4 )) + ||/||ooE^(r 2 )]. 



Let C = 4[8r 2 (E M (r 2 ) + /i(A)E j4 (r 4 )) + ||/ HJE^r 2 )]. We obtain with 

,D rn 



E||/ m — /m|| < C- 



n 



Second step: We do not suppose any more that (X n ) has an atom. 

Let us apply the Nummelin splitting technique to the chain (X n ) and let 

<(t) = ^iti\\t\\ 2 -2t*(X;)}. (13) 

n i=l 

We define also 

fm = argmin T *(t). (14) 

tdzSm 
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Then the property PI in Section O yields E||/ - f^\\ 2 = E\\f - f m \\ 2 . The 
split chain having an atom (property P3), we can use the first step to deduce 
E||/ - /* || 2 < d 2 (f, S m ) + CD m /n. It follows that 

nf-fm\\ 2 <d 2 (f,S m )+CD m /n. 

□ 



Proof of Lemma US For all j, E„\Sj(t)\ m = E^|Si(t)| m = E M | e£-+i*(^ 
where i — t— < t, f >. Thus 



i=fc+l 



|r = Jfe,r(2) =Z P(r = fc,r(2) = /) 



<E(2piioo(/-^)r- 2 E e « 



fc</ 



=fc+l 



|r = jfe,r(2) = I )P{t= k,r{2) = I) 



<£(2|l*l 



\m-2 



i=k+l \ 



) |t = k, r(2) = Z P(r = k, r(2) 



using the Schwarz inequality. Then, since the X$ have the same distribution 
under \i. 



^\Sj(t)\ m < T,( 2 \\t\U m - 2 (l - k) m E(t 2 (X 1 ))P(r = k, r(2) = I) 

k<l 

<E(2|I*I 



) m ~ 2 (Z - A;) m ||/|| 00 ||t|| i P(r = k,r(2) = I) 



<(2||t|| oo r-E(|r(2) 



T 



t 



We conclude by using the Markov property. 



□ 



6. 3 Proof of Corollary [H 



According to Proposition [T ] E|| / — f m \\ 2 < d 2 (f, S m ) + CD m /n. Then we use 
Lemma 12 in iBarron et al.l (119991 ) which ensures that (for piecewise polyno- 
mials or wavelets having a regularity larger than a — 1 and for trigonometric 
polynomials) d 2 (f,S m ) = 0(D~ 2a ). Thus, 

n\f-L\\ 2 = o(D- 2 « + ^) 

n 



1 •*» 2a 

In particular, if D m = [ni+^J, then E||/ - f m \\ 2 = 0(n _I +^). 



□ 
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6. 4 Proof of Theorem^ 



First step: We suppose that (X n ) has an atom A. 

Let m in M. n . The definition of rh yields that j n (fm) + pen(m) < j n (f m ) + 
pen(m). This leads to 

\\U - /II 2 < Wfm ~ /|| 2 + 2i/ n (/ A - f m ) + pen(m) - pen(m) (15) 

where u n (t) = (1/n) £? =1 [t(Xi)- <*,/>]• 

Remark 11 Ift is deterministic, v n (t) can actually be written v n (t) = {1/n) J2i=i[t{Xi 
E(t(XA)]. 

We set B(m,m') = {t E S m + S m >, \\t\\ = 1}. Let us write now 

2^ / n(/m fm) 2||/,fj f m 1 1 V n ( S - — ^ 

^ 1 1 Jm Jm 1 1 

< 2\\f m - f m \\ sup v n {t) < -\\f m - f m \\ 2 + 5 sup z/„(t) 2 

t£B(m,m) " t£B(m,m) 

by using inequality 2xy < -x 2 + 5?/ 2 . Thus, 

5 



2E\u n (U- f m )\<\E\\f m - f m \\ 2 + 5E( sup v n {tf). (16) 

teS(m,m) 

Consider decomposition (TTCTj) of i> n (£) again and let 

1 T (M 

ra i=i+r(i) 

Since |f^(t)| < |Z„(t)|, we can write 



sup v^\i) 2 <p(m } rh) + E I SU P ^n(^) 2 — p(r/2, 7Ti')] + 

tdB{m,m) m'£M n t&B(m,m') 

where p(.,.) is a function specified in Proposition [T2l on page [241 Then, the 
bound ( Till combined with Ml, (fl5l) and (TTBT) gives 
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nu -f\\ 2 <\\f m -f\\ 2 + Ink -u 2 + i60r f (r2)+/i(A) ^ (r4 - 



5 n 
+20 51 E l SU P Z„(t) 2 -p(m,m')] + 

m'gA4„ teB(rn,m') 

+E(20p(m, m) + pen(m) — pen(m)). 

We choose pen(m) such that 20p(m, m') < pen(m)+pen(m'). Thus 20p(m, m)+ 
pen(m) — pen(m) < 2pen(m). Let 

W(m,m') = [ sup Z 2 (t) -p(m,m')) + . (18) 

te-B(m,m') 
1 11 

We use now the inequality —{x + y) 2 < -x 2 + -t/ 2 to deduce 

O Z: 

E||4-/|| 2 <^E||/ A -/|| 2 + ||/ m -/|| 2 +20 Y, W(m,m / )+2pen(m)+- 
3 2 rrr . n 

m'eMn 

and thus 

q Q/7 
E||/ A -/|| 2 <-||/ m -/|| 2 + 30 Y W(m,m , ) + 3pen(m) + — . 

4 m'eX„ 271 



We need now to bound KW(m,m') to complete the proof. Proposition [121 
below implies 

, n , x9 l+i^2 



Ef(ra,m') < if e~^'"'(r V l) 2 fC 

n 

where K' is a numerical constant and K 2 , K 3 depend on the chain and with 

p(m,m') = K dim{Sm n +Sm '\ r V 1) 2 K 3 (1 + K 2 ||/|U). (19) 
The notation a V b means max(a, b). 

Assumption M3 yields J2 m 'eM n e ~ Dm ' < Y,k>i e ~ k = V( e ~ !)■ Thus, by sum- 
mation on m! m. M. n 

J2 EW(m, m) < jf— L (r V 1) 2 K 3 1 + ^ 2 



It remains to specify the penalty, which has to satisfy 20p(m, m') < pen(m) + 
pen(m'). The value of p(m,m') is given by (1T9]) . so we set 

pen(m) > 20K—(r V 1) 2 K 3 (1 + K 2 
n 
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Finally 

Vm E||4 - /f < 3||/ m - /f + 3pen(m) + ^ 

n 

where G\ depends on r , ||/||oo, /i(A), E M (r 2 ), E^(r 4 ), i^" 2 , -^3- Since it is true 
for all m, we obtain the result. 

Second step: We do not suppose any more that (X n ) has an atom. 

The Nummelin splitting technique allows us to create the chain (X*) and to 
define -7* (t) and as above by f|T3l) . f|T4|) . Set now 

m* = argmin[7*(/^) + pen(m)] 

m£M n 

and /* = The property PI in Section O gives E||/ - /|| 2 = E||/ - /*|| 2 . 
The split chain having an atom, we can use the first step to deduce 

E||/ - f|| 2 < 3 inf {d 2 (f,S m ) + pen(m)} + ^. 

meM n n 

And then the result is valid when replacing /* by /. 

□ 

Proposition 12 Let (X n ) be a Markov chain which satisfies A1-A5 and (S m ) me M n 
be a collection of models satisfying M1-M3. We suppose that (X n ) has an atom 
A. Let Z n {t) and W(m,m') defined by (fTTj) and (fTBj) with 

dim^+jW) 2 1 + ||/|UE A (0 

p(m, m ) = K (r V 1) 



n ' (Ins) 2 

(where K is a numerical constant and s is a real depending on the chain). 
Then 

EW(m,m') < K'e- D -'(r V l) 2 1 + \{ IU f ^ 

(In syn 

Proof of Proposition [J2t We can write Z n {t) = (1/n) Y! 3 n =i Sj(t) where 
Sj(t) is defined by According to Lemma [ffij E^S^t)]™ < (2||t|| 0O ) m - 2 
||/|| tX) ||t|| 2 E y 4(r m ). N ow, we use condition A5 of geometric ergodicity. The proof 
of Theorem 15.4.2 in lMeyn and Tweedid fll993h shows that A is a Kendall set, 



i.e. there exists s > 1 (depending on A) such that sup^g^ E x (s T ) < 00. Then 
E A (r m ) < [m!/(lns) m ]E A (V). Indeed 



/■oo 

E A (r m ) = / mx m - x P A {j > x)dx 
Jo 

POO rrj ' 

< / mx m - l s- x E A (s T )dx = ——E A (s T ) 
-Jo v ' (lns) m v ; 
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Thus 



Vm>2 E^Sjit)]" 1 < m\ 



2\\t\ 



m-2 



t 



^ Ins J (Ins) 2 
We use now the following inequality (see iPetrov fll975h p.49): 

P(max^S' i (t) > y) < 2P( £>,,(*) > y - ^2B n ) 



^a{s t ). (20) 



3=1 



where B n > £™ =1 ESj(t) 2 . The inequality ([20} gives us B n = 2n "" ''^'' 2 " E A (s T ) 

(in sj 

and 



/n — 1 t yi, 



where M 2 = ||/||, Y 1 E,4(s ' r )- We use then the Bernstein inequality given by 
Birge and Massartl fll998h . 



P(E<%(*) > ne) < e" 

3=1 



with £ 



2||t| 



In s 



-x + 



2||t||M 
In s 



x . Indeed, according to (I20l) . 



V2||t||M N 



1 " 7T?I 2II/II 

-VEi^wr < ^i(iH^)™- 2 (_ — 

n^-i ' JWI ~ 2 v Ins ; V Ins 



Finally 



P Z n (t) > - — pHooX + M + M\\t\\/y/n 
V Ins L 



< 2e~ 



(21) 



We will now use a chaining technique used in IBarron et al.l (119991) . L et us 
recall first the follow ing; lemma (Lemma 9 p.4 00 in IBarron et al.l (119991 ) , see 
also Proposition 1 in lBirge and Massartl (119981 ) ). 



Lemma 13 Let S a subspace of L 2 with dimension D spanned by (if 
(orthonormal basis). Let 

1 II Eaga/^aIIc 
sup- 



AjAeA 



Dm sup AeA |/3. 



A 



Then, for all 5 > 0, we can find a countable set T C S and a mapping n from 
S to T such that : 



25 



• for all ball B with radius a > 55 

\TnB\<(ha/5) D (22) 

• \\u — 7r(u)|| < 5, Wu G S and sup uejT -i^ \\u — t\\oo < r5, Wt G T. 

We apply this lemma to the subspace S m + S m i with dimension D m V D m i 
denoted by D(m,m') and r = r(m,m') defined by 

CO 

r{m, m ) = sup- 



m,m' j 

where (v?a) AeA(m,m') is an orthonormal basis of S m + S m >. Notice that this 



quantity satisfy (f) m » < r(m,m') < J D(m,m')(f) m ^ where m" is such that 
S m + S m i = S m " and then, using M2, 

r(m,m') < r$\J D{m, m'). 

We consider 5 < 1/5 , 5 k = 5 Q 2~ k , and the T k = T n B (m, m') where T is 
defined by Lemma [TBI with 5 = 4 and B(m,m') is the unit ball of S m + S m >. 
Inequality (f22]) gives us \T fl 5(m,m')| < (5/4) D(m ' m,) . By letting F fc = 
ln(|Tfc|), we obtain 

F A < D(m,m')[\n(^-) + Hn2]. (23) 
4 

Thus, for all u in B(m, m'), we can find a sequence {u k } k >o with G T k such 
that — < 4 and — ^fclloo — r(m,m')5 k . Hence, we have the following 
decomposition: 

CO 

u = u Q + ^{u k - tt fc _i) 
fe=l 



with ||m || < 1 and ||«o||oo < r oy D(m, m')||wo|| < r y.D(m, m') and for all 
k > 1, 



IK - < 4 + 4-i = 34-i/2, 

K - «fc-i||oo <3r(m,m')4-i/2 < 3r Q yfD(m,m')5 k - 1 /2. 



Then 



P( SUp Z n (u) > 7]) =P(3(u k ) k > G J[ T k , Z n (u ) + z n{u k - u k -i) > Vo + Yl Vk) 

u£B(m,m>) fc > k=1 k=1 

CO 

< P(Z n (u ) > Vo) + Y1 H P(Z n (u k - u k -i) > r) k ) 

u GT k=l u k £T k 

v-k-i£T k _ 1 
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with r] + J2kLi Vk < V- We use the exponential inequality (I2T]) to obtain 



-nxg 



P{Z n (u Q ) > vo) <2e Ho ~ 

uq£T 

X; P(Z n (u k - w fc _i) > »7 fc ) <2e^+ H *-i-^ 



u fc _ieT fc _i 



by choosing < 



2 / j— M\ 

Vo = ^ \r ^D(m,m')x + My/x^ + -j=\ 



Vk = ^ \r 0A J D(m,m')8 k -ix k + M5 k - ly /x^ + 



M5 k . 



n 



Let us choose now the (x k ) k >o such that nx = H + D m i + v and for k > 1, 



Thus 



P( sup Z n (u) > r?) < 2e^'^(l + 5" e~ fcD ™') < 3.2e~ D ™'- v 



uGB(m,m') 



k>\ 



It remains to bound J^kLoVk- 



OO 

^2vk< J 7-^(A 1 + A 2 + A 3 ). 



where < 



Ai = r ^D(m,m')(2x + 3 ^ 



fc=i 



k=l 

M « 3M4_! 
^3 = 2— +E- 

fc=i 



7/ 



Regarding the third term, just write 



M ( ~ \ M , , , M 



fe=i 



with ci(5q) =Q5 + 2. 

• Let us bound the first term. First, recall that D(m,m') < \fn and then 



27 



Observing that J2kLi ^fc-i — 25 an d J2T=i k8k-i — 4<5o an d using (|23|) . we get 



At < ci(^ )r ^ + c 2 (5 )^o1 



' D(m, m') 



'nD(m,m') V w 

with c 2 (5 ) = ci(<y ) + ln(5/5 )(2 + 128 ) + Q6 (2 + 3 In 2) 

• To bound the second term, we use the Schwarz inequality and the inequality 
\/ a + b < y/a + Vb. We obtain 



^<c lW M^ + c 3W M,' D(m ' m,) 



n v n 



with c 3 (5 ) = 2^/1 + ln(5/5 ) + yWl + In 2) + 45 ln(5/5 ) 

We get so 



— v Ins J \ . Lnfm mf\ V n 



k=o ^ 111 s ' \JnD(m,m' 



D(m,m') (r V 1\ r 

fc=0 



Ins / nD(m,m') n 
D(m, w!) (tq V 1 



2 

2 



n V Ins / 



where 



fc 4 (5 ) = 6c? 

\c 5 (5 ) = (6/5)sup(c 2 ,c 3 + ci) 2 
Let us choose now <5 = 0.024 and then C4 = 28, C5 = 268. Let K\ = C4(r V 
1/lns) 2 . Then 



V 2 . ,0 D, 



r? 2 = Hfif— V M 2 — ] +p(m,m') 

nU{m,m ) n 



where 

p(m,m') = C5(r V 1) 



2 £(m,m') l + H/II^E^s 7 
n (Ins) 2 
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2 

We get P( sup Zl(u)>K 1 \ V M 2 —] + p(m, to')) 

ueB(m,m') nD(m,m') n 

= P( sup ^(«)>r ? 2 ) 

tieJS(m,m') 

< P( sup Z„(u) > 77) + P( sup Z n (u) < -7]) 

u&B{m,m') u£B(m,m') 

Now 



P( SUp Z n (u) < -77) < P(Z n (u ) < -?7o) + 2 P ( Z n(u k ~ Ufc-l) < 

ueB(m,m') u eT fc=l u k GT k 

u k _ 1 eT k _ 1 

00 

< P(Z n (-U ) > Vo) + J2 S ^(^n(-^fe + Ufc-l) > ?7fc) 

U06T0 fc=l UfcGTfe 

ufc_ieTfc_! 

<3.2e~ D ™ , - ? '. 



Hence 



2 

P( sup Z 2 (u) > K^— — ? -y M 2 -)+p(m,m)) <6Ae' D ^'- v . 



ueB( m ,m') n nD(m,m') n 



We obtain then 



/•OO 

E[ sup Z 2 (t) — p(m, to')] + < / P( sup Z 2 (u) > p(m,m') + z)dz 

teB(m,m') ' JO ueB(m,m') 



<<M 2 D(m,m') 



< P( sup 2?(u) > p(m,m') + K 1 M 2 -)K 1 — dv 
Jo 

o 

/ P( sup Z 2 (w) >p(m,rri) + K 1 ^ 77 ^ -^i^t 21 — -dv 



M£B(m,m') 



M 2 
n 



n 
,2 



M 2 D(m,m') u£B(m,m') 



< 



#1 



n 



M- 



f 

Jo 



6.4e 



-D m i-v 



dv + 



71 



D(to 



nD(m,m') 1 nD(m,m') 

> roo 

'■ — - / 6Ae~ D ™'- v vdv 

, TO ) JO 

1 + M 2 



D(m, to') 



-) < 12.8X ie 



By replacing M 2 by its value, we get so 



In s 



where X' is a numerical constant 



□ 



29 



6. 5 Proof of Corollary [5| 



According to Theorem El Ell/ - f\\ 2 < C 2 inf {d?(f,S m ) + ZWn}. Since 



d 2 (f, S m ) = 0(D m 2a ) (see Lemma 12 in iBarron et all (Il999h ) 



E||/-/l| a <C 3 mf {^ m 2a + ^} 

m£.A/f n n 

In particular, if m is such that D mo = [nw^ J , then 

Ell / " /II 2 < + ^} < ^n™. 

The condition D m < y/n allows this choice of m only if a > |. □ 

Proof of Theorem [7| 
The proof is identical to the one of Theorem [31 □ 

6. 7 Proof of Corollary [21 

It is sufficient to prove that d(g, S$) < D^ 1 if g belongs to B^HO, l] 2 ). It 
is done in the following lemma. □ 

Lemma 14 Let g in the Besov space B^ ^([O, l] 2 ). We consider the following 
spaces of dimension D 2 : 

• Si is a space of piecewiwe polynomials of degree bounded by s > a — 1 based 
on a partition with square of vertice 1/D, 

• S*2 is a space of of orthonormal wavelets of regularity s > a — 1, 

• S3 is the space of trigonometric polynomials. 

Then, there exist positive constants Ci such that 

d(g,Si) <dD- a for 1 = 1,2, 3. 



Proof of Lemme T\: Let us recall the definition of -Bf^QO, 1] )• Let 



1 r 



A r h g(x,y) = Y.(- 1 Y~ k y k )9(x+ kh u y + kh 2 
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the rth difference operateur with step h and u r (g,t) = sup||A^g|| 2 the rth 

\h\<t 

modulus of smoothness of g. We say g is in the Besov space ^^([O, l] 2 ) if 
sup t>0 t~ a u r (g,t) < oo for r = [a\ + 1, or equivalently, for r an integer larger 
than a. 



DeVord (119981 ) proved that d(g, Si) < Cuj s+1 (g, D~ l ) , so 

d{g,S 1 )<CD-<*. 



For the wavelets case, we us e the fact tha t / belongs to B% ^([0, l] 2 ) if and 



Meyer (1990) chapter 6, section 10). If go is the 



only if sup 2 JQ || / 3 7 || < oo (see 
j : i 

orthogonal projection of g on S 2 , it follows from Bernstein's inequality that 

h - toll 2 = E E I&hI 2 < c E 2- 2iQ < crrr* 



where m is such that 2 r 



i>m fc,i 

= D. 



j>m 



For the trigonometric case, it is proved in iNikol'skiil (119751 ) (p. 191 and 200) 
that d(g, S 3 ) < Ca; s+ i(#, D~ l ) so that (%, S 3 ) < C'D~ a . □ 



5. 5 Proof of Theorem 



Let us prove first the first item. Let E n = {\\f — f\\oo < x/2} and E n its 
complementary. On E n , f(x) = f(x) — f(x) + f(x) > x/2 and for n large 
g fx, ?/) 

enough, fc(x,y) = — . For all (x, y) G [0, l] 2 , 



/(*) 



\ft(x,y) -n(x,y)\ 2 < 



I fey) ~ f(x)n(x,y) 

/» 



l t En + (||7f|U + lk||oo) 2 i 



< 



Ife, 1/) - fe> 1/) + fe 2/)(fe) - fe) 



X 2 /4 



+ (fln + |k||oo) 2 lj 



E||tt -n\\ 2 < -[E\\g - ~ g \\ 2 + hWMlf - /111 + K + IkU^^) 
It remains to bound P(E n \). To do this, we observe that 

11/ / 1 1 00 ^ 11/ fin 1 1 00 || frh fm \ \ oo 



Let 7 = a - |, then B^QO, 1]) C 5£ iOO ([0,l]) (see IPeVore and Lorent2 
(119931 ) p. 182). Thus / belongs to ^([0, 1]) and Lemma 12 in iBarron et al. 
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(1199a ) gives 

\\f - U\\oo < < Onn)-^ 
Thus ||/ — do decreases to and ||/ — / m ||oo < x/4 for n large enough. So 

PW<P(||/*-/*||oo>f) 

But ||/ m - /m||oo < rov/A^II/m - /m|| < r n l/8 \\f lh - / m || and ||/ m - /^|| 2 = 
E^eA^M- Thus, 

< p( E ^ } (^) 2 + 4 2) (^) 2 + #W > 
+ bup E p&ivx) > ^-^) 

rri£M„ xeA m OZTqU 

We need then to bound two terms. For the first term, let S mo the maximum 
model with cardinal D mo < n 1 / 4 . Since A A C A mo and using inequality ffTTT) 
and the assumption Vm D m < n 1//4 , we obtain 

E 4%a) 2 + ^a) 2 + ^a) 2 ) < C'n^ 
* AeA A 

Besides, for all x and for all A, using (12T|) . 

P{Z n {V\) > 2r n 1/8 x + 2M^x~ + 2-=) < 2e~ nx 

Jn 



and so 

P{Z 2 n {y x ) > (2r n 1/8 x + 2M^ + 2^=) 2 ) < Ae~ nx 
Let now x = n _3//4 , x verifies (for n large enough) 



that yields 



2r n 3/8 x + 2Mn 1/4 Vi + 2Mn~ 1/A < — ^= 

r V32 



(2r Q n l/8 x + 2M^fx~ + 2^=) 2 < X 



^/n 32r 2 ,n 1 / 2 
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The previous inequality gives then 

P (^a) > < 4e- < 4e 



X \ • A „—nx ^ a „— n 1 / 4 



Finally 



P(E c n ) < 4n 1 / 4 e~ nl/4 + C'n~ 5 ^ < C 



> n -5/4 



for n great enough. And then, for n large enough, (a n + ||7r|| 0O ) 2 P(i?^) < 
Ca 2 n n~^ A . So, since a n = o^ 1 / 8 ), (a n + ||vr|| 00 ) 2 P(^) = o{n- 1 ). 

Following result in Theorem[9]is provided by using Corollary [5] and Corollary[8] 

□ 
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